\chapter{The Data Lifecycle}

\section{Introduction}
Data Lifecycle:
Acquiring Data $\rightarrow$ store data $\rightarrow$ explore data
$\rightarrow$ clean data $\rightarrow$ analyze data $\rightarrow$
store results $\rightarrow$ archive data.

This translates into the following activities:

Raw Data $\stackrel{\mbox{ data
    munging/scrapping/wrangling/cleaning}}{\rightarrow}$  
clean data $\stackrel{\mbox{EDA}}{\rightarrow}$ clean data
$\stackrel{\mbox{  ML algorithms/data mining
algorithms}}{\rightarrow}$ $\rightarrow$  data analysis
$\stackrel{\mbox{ visualize/build reports/communicate}}{\rightarrow}$
  final result.

Note that, to start with, there are some decisions as to what/when/how
to collect. Even raw data is not that raw!

As a side effect of each step, we should produce {\em metadata} to
document it. This makes results of analysis repeteable and
understandable.

\section{Model Fitting}
All methods have limitations, and it may not be clear when to use each
method -or, once a method is applied, how to evaluate how good the
model is.

Limitations:
\bi
\item mean is very sensitive to incorrect data and outliers. Likewise,
  correlation measures are also not stable.
\item It is hard to find good clusters without an estimate of the
  number of clusters and the relative size of those clusters. If there
  are two clusters in the data, one large (90\% of data), one small
  (10\% of data), most algorithms will have trouble. K-means with
  random initialization will need to be started with at least 10
  clusters to detect those two. 
\item logistic regression work well when the appropriate number of
  relevant parameters is used. Two few parameters, or irrelevant ones
  may lead to a bad fit even if there is a logistic regression in the
  data. 
\item Linear regression can be abused, as it will produce a 'good'
  fit (reasonable goodness of fit measure with something like
  R-square) even when the data is not linearly related. Assumptions
  about data distribution or error distribution (normal) must be
  checked.
\item Dependence on parameters: clustering depends largely on the idea
  of 'distance' used.
\item Spurious correlations: data series where elements move together
  over a period of time are relatively easy to find.
\ei
